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Abstract 

Approximating the length of the longest increasing sequence (LIS) of a data stream is a well- 
studied problem. There are many algorithms that estimate the size of the complement of the LIS, 
referred to as the distance to monotonicity, both in the streaming and property testing setting. 
Let n denote the size of an input array. Our aim is to develop a one-pass streaming algorithm that 
accurately approximates the distance to monotonicity, and only uses polylogarithmic storage. For 
any S > 0, our algorithm provides a (1 + <5)-multiplicative approximation for the distance, and 
uses only 0((log 2 ri)/$) space. The previous best known approximation using poly-logarithmic 
space was a multiplicative 2-factor. Our algorithm is simple and natural, being just 3 lines 
of pseudocode. It is essentially a polylogarithmic space implementation of a classic dynamic 
program that computes the LIS. This leads to additive Sn- approximations with poly- logarithmic 
space, for any constant S. Previously, it was not known how to get such bounds for S < 1/2. 

Our technique is more general and is applicable to other problems that are exactly solvable 
by dynamic programs. For example, we are able to get a streaming algorithm for the longest 
common subsequence problem (in the asymmetric setting introduced in |AKO10j ) whose space 
is small on instances where no symbol appears very many times. Consider two strings (of length 
n) x and y. The string y is known to us, and we only have streaming access to x. The size of the 
complement of the LCS is the edit distance between x and y with only insertions and deletions. 
If no symbol occurs more than k times in y, we get a <3(£;(log 2 n)/<5)-space streaming algorithm 
that provides a (1 + <5)-multiplicative approximation for the LCS complement. In general, we 
also provide a deterministic 1-pass streaming algorithm that outputs a (1 + <5)-multiplicative 
approximation for the LCS complement and uses 0{yj (n log n)/<5) space. All these algorithms 
are based on small space streaming algorithms that follow a dynamic program. 
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1 Introduction 



Finding the longest increasing subsequence (LIS) and longest common subsequence (LCS) of arrays 
is a classic algorithmic problem. Consider a string x over the alphabet S, where x is represented 
as a function x : [n] — > S. A subsequence is the string x(ii)x(i2) . . . x(if-), where 1 < i\ < 12 < 
• • • < ik < n - If the alphabet £ comes equipped with a (total or partial) order <, we say that the 
subsequence is increasing if for all j, x(ij) < x(ij + i). Given strings x and y, a common subsequence 
is a string that is a subsequence of both strings. The standard optimization problem of LIS and 
LCS is the find the appropriate longest subsequence. Note that the LIS of x is the LCS of x and its 
sorted version. 

These problems have standard dynamic programming solutions. The LIS can be found on 
O(nlogn) time [Sch61^ lFre75tlA"D99| . This is known to be optimal, even for algorithms that only 
determine the length of the LIS [Ram97]. The LCS problem has a fairly direct 0(n 2 ) algorithm 
[CLRSOOj . which can be improved to 0(n 2 / log 2 n) [MP80llBFC08] . It is a notoriously difficult open 
problem to improve this bound, or prove some matching lower bounds. 

It is often natural to focus on the complements of the LIS and LCS lengths, which are related to 
some notion of distances between strings. The size of complement of the LIS is called (edit) distance 
to monotonicity, the minimum number of values that need to be changed to make x monotonically 
increasing (conventially, this is represented as fraction of n). The size of the complement of the 
LCS is the edit distance between x and y, when insertions and deletions are allowed. The standard 
notion of Levenshtein distance, where insertions, deletions, and substitutions are allowed, is at least 
half of this edit distance. Naturally, computing these complements exactly is the same as LIS or 
LCS computations, but approximating these quantities can be very different. 

In recent years, there has been a lot of attention on giving approximate solutions for LIS and LCS 
that are much more efficient that the basic dynamic programming solutions. Any improved results for 
LCS would be very interesting, since the best known quadratic time solution is infeasible in practice. 
These problems can be studied in a variety of settings - sampling, streaming, and communication. 
The streaming setting has been the focus of many results [G JKK07l lSW071[GG07pEJ08| . The model 
for the LIS is that we are allowed one (or constant) passes over the input string x, and only have 
access to sublinear storage. 

For LCS, the streaming model assumes that we stream through both the strings x and y. A recent 
breakthrough result on approximating edit distance |AKO10j consider an asymmetric sampling 
model. Here, we have full access to the string y and are not charged for reading characters of y. 
Any access to the string x is accounted for in the query complexity. This is a very helpful model 
in understanding Levenshtien distance, and a step towards designing better algorithms for LCS. 
Inspired by this, we consider the asymmetric streaming model. We have full access to the string y 
and the string x is streamed through. 

1.1 Results 

Our first result is a streaming algorithm for accurately approximating the edit distance to mono- 
tonicity. Previous results that either gave only a 2-approximation in poly-logarithmic space [EJ08 , 
or used Q(y/n) space [GJKK07] . (We note that the result of [GJKK07] gave much stronger multi- 
plicative approximations to the LIS in £l(y/n) space.) The error probability of all our algorithms 
is n~^W, unless otherwise specified. This also leads to a streaming algorithm providing an addi- 
tive 8n- approximations to the LIS using poly-logarithmic space (for any constant choice of 5). The 
previous streaming algorithms could only provide an additive n/2-approximation in the worst case. 
Indeed, when the LIS length was less than n/2, previous algorithms would simply return zero as 
the their LIS estimate. Hence, we stress that our improvement from a factor 2 to (1 + 5) is not just 
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"chipping away" at a constant, but a fundamental improvement. 

Theorem 1.1 Consider the streaming setting where the input length is n. For any 5 > 0, there is 
a randomized one-pass 0(5~ 1 log 2 re)- space streaming algorithm that outputs a (1 + 5) -multiplicative 
approximation for the distance to monotonicity of the input stream. 

We observe that this yields an additive 5n approximation to the LIS length. This is much stronger 
than all previous results, which give in the worst case an additive re/2 approximation. Our result 
is actually a corollary of a more general algorithm that finds increasing sequences in partial orders. 
The main theorem is quite powerful and can also be applied to the asymmetric LCS problem. Given 
two strings x and y of length re, let d(x, y) be the edit distance under insertions and deletions. 

Theorem 1.2 Consider the asymmetric streaming setting where the input length is n and each 
symbol occurs at most k times in the string in hand. For any 5 > 0, there is a randomized one- 
pass 0(5^ 1 k log 2 n)- space streaming algorithm that outputs a (1 + 5) -multiplicative approximation 
to d(x,y). There is also a deterministic one-pass 0(i/(n log re) / 5) -space streaming algorithm that 
outputs a (1 + 5) -multiplicative approximation to d(x,y), regardless ofk. 

As far as we know, there are no algorithms for streaming LCS in this setting. Note the strength of 
these bounds. The best sub-quadratic algorithms gives only poly-logarithmic factor approximations 
for the edit distance [AKO10 . (Our algorithm has a large update time, so our total running time 
is large.) Note that at this range of approximation, both versions of edit distance (with or without 
substitutions) are basically equivalent. We are able to get extremely accurate approximations for 
d(x,y) in this streaming setting. When k = 1, this corresponds to a variant of the Ulam distance 
between permutations (edit distance without substitutions). We also give a 0(yjn) space algorithm 
that is oblivious to k. 

1.2 Techniques 

One of the most appealing features of our result is the extreme simplicity of the basic algorithm. For 
finding the LIS of an array, we can give the pseudocode in just a few lines. We define a probability 
a(i,t) that is approximately 1/S(t — i). The exact formula is not complicated, but just a little 
cumbersome to state here. The algorithm maintains a set of indices R, and for each i S R, we store 
an estimate r(i) (the complement of the length of the LIS ending at i) and the number x(i) of the 
array. For convenience, we add dummy elements x(0) = x(n + 1) = — oo and begin with R = {0}. 
For each time t > 1, we perform the following update: 

1. Define R' = {i £ R\x{i) < x(t)}. Set r(t) = mm ieR ,(r(i) + t-l-i). 

2. R < — RU{t}. 

3. Remove each i £ R independently with probability a(i, t). 

The final output is r(n + l)/n, and is guaranteed to be a one-sided (1 + ^-approximation to the 
distance to monotonicity of x. The algorithm is not entirely intuitive, but is incredibly simple in 
terms of computation. The total amount of space is the maximum size of R, which we can bound 
above by 0{5^ 1 log 2 n) with high probability. 

All past poly-logarithmic space algorithms [GJKK07,EJ08] for LIS use combinatorial character- 
izations of increasing sequences based on inversion counting 1EKK+00UDGL+991IPRR06UACCL07| . 



While this is a very powerful technique, it does not lead to accurate approximations for the LIS. 
Hence, these do not any yield any generalizations to LCS, a much harder problem than LIS. 

To get around this, we try to follow an exact dynamic program for the LIS. It is known that 
there are 1-pass streaming algorithms that exactly compute the LIS (using linear space, of course) 



2 



[Fre75l KD99j. There has been a 0(y / n)-space deterministic implementation of this, which yields 
sharp approximations for the LIS length (GJKK07] . We essentially give a poly-logarithmic space 
randomized implementation of this algorithm and prove that it can output sharp approximations for 
the distance to monotonicity. The algorithm can be generalized to finding long chains in partially 
ordered streams. Furthermore, as shown earlier, this leads to much simpler streaming algorithms 
than previous results. 

The streaming implementation of the exact dynamic program remembers the length of the LIS 
up to every index up to the current position, which is a linear amount of information. We save this 
information only for a set of positions whose density decays exponentially as one moves backwards 
in time. Notice that since we are not remembering all of the information needed to run the dynamic 
program, the values we store may not match the corresponding values in the exact dynamic program. 
The main point is to show that the remembered values are unlikely to drift far away from the values 
they are intended to approximate. 

This idea of remembering selected information about the sequence that becomes sparser as one 
goes back in time was first used by [GJKK07] for the inversion counting approach. Our work seems to 
be the first use this to directly mimic the dynamic program. Superficially, it may appear that there 
is a connection to the sampling based algorithm that obtains a (l + <5)-approximation to the distance 
to monotonicity [SS10J. That result also overcame the 2-factor barrier for sublinear algorithms. We 
stress that there is little connection between this new streaming algorithm and the sampling based 
one. This can be clearly seen by just looking at the algorithm - the sampling algorithm was an 
extremely complex procedure, completely different from the simple 3 lines presented here. 

This line of thinking can be exploited to deal with asymmetric streaming LCS. We construct a 
simple reduction of LCS to finding the longest chain in a specific partial order. This reduction has a 
streaming implementation, so the input stream can be directly seen as just elements of this resulting 
partial order. This reduction blows up the size of the input, and the size of the largest chain can 
become extremely small. If each symbol occurs k times in x and y, then the resulting partial order 
has nk elements. Nonetheless, the longest chain still has length at most n. We require very accurate 
estimates for the length of the longest chain. This is where the power of the (1 + <5)-approximation 
comes in. We can choose 5 to be much smaller to account for the input blow up, and still get a good 
approximation. Note that if we only had a 1.01-approximation for the longest chain problem, this 
reduction would not be useful. 

Our 0(y / n)-space algorithm also works according to the basic principle of following a dynamic 
program, although it uses one different from the previous algorithms. This can be thought of as 
generalization of the 0(y / n)-space algorithm for LIS [GJKK07]. We can basically maintain a 0(y/n)- 
space deterministic sketch of the data structure maintained by the exact algorithm. By breaking 
the stream up into the right number of chunks, we can update this sketch using 0(y / n)-space. 

1.3 Previous work 

The study of LIS and LCS in the streaming setting was initiated by Liben-Nowell et al [LNVZ05J, 
although their focus was mostly on exactly computing the LIS. Sun and Woodruff [SW07] improve 
upon their algorithms and lower bounds and also prove bounds for the approximate version. Most 
relevant for our work, they prove that randomized protocols that compute a (1 + ^-approximation 
of the LIS length essentially require fi(e -1 logn). Gopalan et al [GJKK07] provide the first poly- 
logarithmic space algorithm that approximates the distance to monotonicity. This was based on 
inversion counting ideas in (PRR06, ACCL07]. Ergun and Jowhari |EJ08] give a 2-approximation 
using the basic technique of inversion counting, but develop a different algorithm. Gal and Gopalan 
[GG07] and independently Ergun and Jowhari [EJ08] proved an 0(y / n) lower bound for deterministic 
protocols that approximate that LIS length upto a multiplicative constant factor. For randomized 
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protocols, the Sun and Woodruff bound of O(logn) is the best known. One of the major open 
problems is to get a o(y/n) space randomized protocol (or an Q(y/n) lower bound) for constant 
factor approximations for the LIS length. Note that our work does not imply anything non-trivial 
for this problem. 

A significant amount of work has been done in studying the LIS (or rather, the distance to 
monotonicity) in the context of property testing iEKK+0Q[[DGL+99[[Fis011IPRR061IACCL07j . The 



property of monotonicity has been studied over a variety of domains, of which the boolean hypercube 
and the set [n] (which is the LIS setting) have usually been of special interest [GGL + 00llDGL + 99"l 
lFLN+n2lliTlt03llACCLn7llPRRn6l[BGJ + n9j . In previous work, the authors of this paper found a 
(1 + ^-multiplicative approximation algorithm for the distance that runs in time 0(poly log(n)) 
[SS10]. In that setting, the algorithm has random access to the input but only has time to look at 
a small fraction of the input (the standard property testing setting). Note the difference from the 
streaming setting. The streaming algorithm in this paper is completely different from (and much 
simpler than) the additive approximation algorithm. 

The LCS and edit distance have an extemely long and rich history, especially in the applied do- 
main. We point the interesting reader out to [Gus97 ; Nav01] for more details. Andoni et al |AKO10j 
achieved a breakthrough by giving a near-linear time algorithm that gives poly-logarithmic time 
approximations for the edit distance. This followed a long line of results, which is well docu- 
mented in [AKOIO] , Most interestingly, they initiated the study of the asymmetric edit distance, 
where one string is known and we are only charged for accesses to the other string. For the case 
of non-repetitive strings, there has been a body of work on studying the Ulam distance between 
permutations |AK07llAK08ll AIK09U ANlO] . 

2 Paths in posets 

We begin by defining a streaming problem called the Approximate Minimum- defect path problem 
(AMDP). We define it formally below, but intuitively, we look at the stream as a sequence of 
elements from some poset. Our aim is to estimate the size of the complement of the longest chain, 
consistent with the stream ordering. This is more general problem that LIS, and we will later show 
how streaming algorithms for LIS and LCS can be obtained from reductions to AMDP. 



2.1 Weighted P-sequences and the approximate minimum-defect path problem 

We use P to denote a fixed set endowed with a partial order <. The partial order relation is given 
by an oracle which, given u,v £ P outputs u < v or -i(tt < v). For a natural number n we write [n] 
for the set {1,2,..., n}. 

A sequence a = (<r(l), . . . , cr(n)) G P is called a P-sequence. The number of terms a is called 
the length of a and is denoted |<r|; we normally use n to denote \a\. A weighted P-sequence consists 
of a P-sequence a together with a sequence (w(l), . . . , w(n)) of nonnegative integers; w(i) is called 
the weight of index i. In all our final applications w(i) will always be 1. Nonetheless, we solve this 
slightly more general weighted version. 

We have the following additional definitions: 

• For t 6 [n], a<t denotes the sequence (a\, . . . ,a t ). Also for J C [n], J<t denotes the set 

Jn{i,. ..,*}. 

• For J C [n], w(J) = J2 je j w (j)- 

• The digraph D = D(a) associated to the P-sequence a has vertex set [n] (where n = \a\) and 
arc set {i — >• j : i < j and a(i) < cr(j)}. 
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• A path 7T in D{a) is called a a-path. Such a path is a sequence 1 < t\\ < . . . < ir^ < n of 
indices with tti — >■ • • • — >■ 7Tfc. We say that tt ends at tt^. 

• The de/eci of path tt, defect(7r) is defined to be w([n] — tt). 

• min-defect (cr, w) is defined to be the minimum of defect(7r) over all cr-paths tt. 

We now define the Approximate Minimum- defect path problem (AMDP). The input is a a 
weighted P-sequence (cr, w), an approximation parameter 5 > 0, and an error parameter 7 > 0. 
The output is a number A such that: Prob[A £ [min-defect (a, w), (1 + <5) min-defect (<r, w)]] > 1 — 7. 
An algorithm for AMDP that has the further guarantee that A > min-defect (a, w) is said to be a 
one-sided error algorithm. 

2.2 Streaming algorithms and the main result 

In a one-pass streaming algorithm, the algorithm has one-way access to the input. For the AMDP, 
the input consists of the parameters S and 7 together with a sequence of n pairs ((a(t),w(t)) : t G [n]). 
We think of the input as arriving in a sequence of discrete time steps, where 5, 7 arrive at time step 
and for t £ [n], (a(t),w(t)) arrives at time step t. 

The main complexity parameter of interest is the auxiliary memory needed. For simplicity, we 
assume that each memory cell can store any one of the following: a single element of P, an index 
in [n], or an arbitrary sum w( J) of distinct weights. Associated to a weighted P-sequence (a, w) we 
define the parameter: p = p{w) = ^ Wi- Typically one should think of the weights as bounded by 
a polynomial in n and so p = n ot - l \ The main technical theorem about AMDP is the following. 

Theorem 2.1 There is a randomized one-pass streaming algorithm for AMDP that operates with 
one-sided error and uses space 0( ln(w ^ ln( ^ ). 

In particular, if p = and 7 = l/n°W then the space is Q( ( ln W) ). 

3 The algorithm 

Our streaming algorithm can be viewed as a modification of a standard dynamic programming 
algorithm for exact computation of min-defect (a, w). We first review this dynamic program. 

3.1 Exact computation of min-defect ( cr, w) 

It will be convenient to extend the P-sequence by an element a{n + 1) that is greater than all other 
elements of P. Thus all arcs j — > n + 1 for j E [n] are present. Set w(n + 1) = 0. We define 
sequences s(0), . . . , s(n+l) and W(0), . . . , W(n + 1) as follows. We initialize s(0) = and W(0) = 0. 
For t e[n+ 1]: 

W{t) = W(t-l)+w{t) 
s(t) = min(s(z) + W(t - 1) - W(i) : i < t such that a { — ><T t ))- 

Thus W(t) = w([t]). It is easy to prove by induction that s(t) is equal to the minimum of W(t) — w(tt) 
over all paths n whose maximum element is a(t). In particular, min-defect (cr, w) = s(n + 1). 

The above recurrence can be implemented by a one-pass streaming algorithm that uses linear 
space (to store the values of s(t) and W{t)). 
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3.2 The poly-log space streaming algorithm 

We denote our streaming algorithm by T = T(a,w,5, r y). Our approximation algorithm is a natural 
variant of the exact algorithm. At step t the algorithm computes an approximation r(t) to s(t). 
The difference is that rather than storing r(i) and W(i) for all i, we store them only for an evolving 
subset R of indices, called the active set of indices. The amount of space used by the algorithm is 
proportional to the maximum size of R. 

We first define the probabilities p(i,t). Similar quantities were defined in [GJK K07J. 

i) = 1 i) = ! I .^, ,t \~, for i > i, 

g(*,t- 1) 

We initialize R = {0}, r(0) = and W(0) = 0. The following update is performed for each time 
step t € [n + 1]. The final output is just r(n + 1). 

1. W(t) = W(t-l) +w{t). 

2. r(t) = min(r(i) + W(t - 1) - W(i) : i £ R such that crj — ^ a t ). 

3. The index i is inserted in R. Each element i G i? is (independently) discarded with probability 

i-p(M). 

We will prove: 

Theorem 3.1 On input (a,w,5,^), the algorithm T satisfies: 

• r(n + 1) > min-defect(<T, w). 

• Prob[r(n + 1) > (1 + 5)min-defect(cr, w)] < 7/2. 

• The probability that \R\ ever exceeds ln(2p) ln(4n 3 /7) is at most 7/2. 

The above theorem does not exactly give what was promised in Theorem 12.11 For the algorithm 
r, there is a small probability that the set R exceeds the desired space bound while Theorem 12.11 
promises an upper bound on the space used. To achieve the guarantee of Theorem 12. II we modify T 
to an algorithm V which checks whether R ever exceeds the desired space bound, and if so, switches 
to a trivial algorithm which only computes the sum of all weights and outputs that. This guarantees 
that we stay within the space bound, and since the probability of switching to the trivial algorithm 
is at most 7/2, the probability that the output of V exceeds (1 + <5)min-defect(cr, w) is at most 7. 

We now prove Theorem 13.11 The first assertion is a direct consequence of the following proposi- 
tion. 

Proposition 3.1 For all j < n + 1 we have r(j) > s(j) and thus r(n + 1) > min-defect(cr, w). 

The second part will be proved in the two subsection. The final assertion of Theorem 13.11 showing 
the space bound is deferred to Appendix lAl 
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3.3 Quality of estimate bound of Theorem 13.11 

We prove the second assertion of Theorem 13.11 which is the main technical part of the proof. Let 
Rt denote the set R after processing a(t),w(t). Observe that the definition of p(i,j) implies: 



We need some additional definitions. 

• For / C [n + 1], we denote [n + 1] — / by I. 

• Let C be the index set of some fixed chain having minimum defect. We assume without loss 
of generality that n + 1 E C. 

• We write R* for the subset R at the end of step t. Note that R l C [t]. We define F l = [t] — R l . 
An index i E R l is said to be remembered at time t and i E F t is said to be forgotten by time 
t. 

• Index i E C is said to be unsafe at time t if every index in C n [i,t] C F*, i.e., every index of 
C n [i, t] is forgotten by time t. We write U t for the set of indices that are unsafe at time t. 

• An index i E C is said to be unsafe if it is unsafe for some time t > i and is sa/e otherwise. 
We denote the set of unsafe indices by U. On any execution, the set U is determined by the 
sequence R 1 ,. . . ,R n . 

Lemma 3.3 On any execution of the algorithm, r[n + 1) < w(C U U). 

Proof: We prove by induction on t that if t € C then r(t) < w(C<t-l U U t ~ 1 ). Assume t > 1 and 
that the result holds for j < t. We consider two cases. 



Case i. U 1 - 1 = C< t -i. Then w(C< t -i U l^ 1 ) = W(t - 1). By definition r{t) < r(0) + W(t - 
1) - W(0) = W(< - 1), as required. 

Case ii. C/* -1 7^ C<t_i. Let j be the maximum index in C<t-i ~ U 1 ^ 1 . Since j,t E C we must 



C< t ~i - U*- 1 we have: C< t -i U C/*" 1 = C^-i U C/^ 1 U [j + 1, t - 1], and so: 

r(t) < r(j) + W(t - 1) - W(j) < w<p<j- x U C/^ 1 \j\j + l,t- 1]) < w(C< t _ 1 U U 1 ' 1 ) 



□ 

By Lemma [3.31 the output of the algorithm is at most w{C) +w(U)) = min-defect(o", w) + w(U). 
It now suffices to prove: 



Call an interval [i,j] dangerous if w(C (1 < w([i,j])(5/(l + 5)). In particular is dangerous 
iff i C. Call an index i dangerous if it is the left endpoint of some dangerous interval. Let D be 
the set of all dangerous indices. 

We define a sequence Ii, I2, ■ ■ ■ , Ii of disjoint dangerous intervals as follows. If there is no 
dangerous interval then the sequence is empty. Otherwise: 

• Let i\ be the smallest index in D and let I\ be the largest interval with left endpoint i\. 

• Having chosen if D contains no index to the right of all of the chosen intervals then 
stop. Otherwise, let be the least index in D to the right of all chosen intervals and let 
Ij+l be the largest dangerous interval with left endpoint ij+i- 



Proposition 3.2 For each i < t < n, Prob[i E Rt] = Ylj e u,t]P(hj) = <z(M)- 




r{t) < r(J) + W(t - 1) - W{j). 
Since j is the largest element of 



Probst/) > 5w{C)\ < 7/2. 



(1) 
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It is obvious from the definition that each successive interval lies entirely to the right of the 
previously chosen intervals. Let B = I\ U • • • U Ii and let B = [n] — B. We now make a series of 
observations: 

Claim 3.4 C C D C B. 

Claim 3.5 w(B) < w(C)(l + 8). 

Claim 3.6 Prob[£7 CB]>1- 7/2. 

By Claims I37i1 and 13.61 we have U U C C S with probability at least 1 — 7/2, and so by Claim 
[331 w(U UC) < w(C)(l + 5) with probability at least 1 - 7/2, establishing ©. 
Thus it remains to prove the claims. 

Proof of Claim [3.41 If i G C 1 then, as noted earlier, i is dangerous so i G -D. 

Now suppose i G .D. By the construction of the sequence of intervals, there is at least one interval 
I\ and the left endpoint i\ is at most i. If i G I\ C B, we're done. So assume i h and so i is to 
the right of Ji. Let j be the largest index for which i is to the right of Ij. Then Ij+i exists and 
< Since Ij + \ is not entirely to the right of i we must have i G Ij+i C B. 

Proof of Claim [3JJ For each 7,- we have w(Ij n C) < w(Ij)5/(l + 5). Therefore u>(Ij flC)> 
w(Ij)/(l + (5) and so (1 + 5)w(Ij CiC) > w(Ij). Summing over Ij we get (1 + 5)w{C) > w{B). 

Proof of Claim 13. 6t We fix t G [n] and i G B H [t] and show Prob[i G U l ] < -tL. This is enough 
to prove the claim since we will then have: 

n 

Prob[C7 CB] = 1- Prob[i? n ^ 0] > 1 - E Prob[5 n 17* ^ 0] 



t=l 



> 1-E E Prob[i G C/*] > 1 - e E z^ 1 -^ 2 - 

t=i tesn[t] * =1 ieBn[t] 

So fix t and i £ B D [t]. Since i $ B, the interval [z,t] is not dangerous, and so w{C n [i, t]) > 
«;([£, i])<V (1 + (5), and so 

w([i,t])<^w(Cn[i,t}). (2) 



We have i G J7* only if every index of Cfl [£, i] is forgotten by time t. For j < t, the probability 
that index j G i has been forgotten by time t is 1 — q(j,t) so Prob[i G C7*] = rijecn[i,t](^ ~~ qU^))- 
If i) = 1 for any of the multiplicands then the product is 0. Otherwise for each j G C Pi [i, t]: 

1 + 8 , / . q , n W?) , / , q , . 1 I 8 w(j) , . , o , , 



5 x 1 u (W(t) -W(j -I) ~ K '"5 w([i,t}) ~ v 1 "w{Cr\[i,t}Y 
where the final inequality uses ([2]). Therefore: 

Prob[i G U(t)] < (l-<7(j,i))<exp(- E ?0">*)) < 7/4i 3 , 

jecn[i,t] jGCn[i,t] 

as required to complete the proof of Claim 13.61 and of the second assertion of Theorem 13.11 
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4 Applying AMDP to LIS and LCS 

We now show how to apply Theorem 12.11 to LIS and LCS. The application to LIS is quite obvious. 
We first set some notation about points in the two-dimensional plane. We will label the axes as 1 
and 2, and for a point z, z(l) (resp. z(2)) refers to the first (resp. second) coordinate of z. We use 
the standard coordinate-wise partial order on z. So z <i z' iff z(l) < z'(l) and z{2) < z'(2). 

Proof: (of Theorem I l.ip The input is a stream (x(l),a;(2), . . . , x(n). Think of the ith element of 
the stream as the point (i,x(i)). So the input is thought of as a sequence of points. Note that the 
points arrive in increasing order of first coordinate. Hence, a chain in this poset corresponds exactly 
to an increasing sequence (and vice versa). We set 7 = and p = n in Theorem 12.11 □ 

The application to LCS is somewhat more subtle. Again, we think of the input as a set of points 
in the two-dimensional plane. But this transformation will lead to a blow up in size, which we 
counteract by choosing a small value of 5. 

Theorem 4.1 Let x and y be two strings of length where each character occurs at most k times 
in y. Then there is a 0(5~ 1 k log 2 n) -space algorithm for the asymmetric setting that outputs a 
multiplicative (1 + 5) -approximation ofd(x,y). 

Proof: We show how to convert an instance of approximating d(x, y) in the asymmetric model 
to an instance of AMDP. Let P be the set of pairs {(i,j)\x(i) = y(j)} under the product order 
{hj) < (*' ' if) if i < «' and j < j'. It is easy to see that common subsequences of x and y correspond 
to chains in this poset. 

Now we associate to the pair of strings x, y the sequence a consisting of points in P listed 
lexicographically precedes (i',j r ) is i < i' or if i = i' and j < j' .) Note that a can be 

constructed online given streaming access to x: when x(i) arrives we generate all pairs with first 
coordinate i in order by second coordinate. Again it is easy to check that common subsequences 
of x and y correspond to c-paths as defined in the AMDP. Thus the length of the LIS is equal to 
the size of the largest u-path. It is not true that d(x, y) is equal to min — defect(o~) (here we omit 
the weight function, which we take to be identically 1), because the length of a is, in general longer 
than n. Given full access to y, and a streamed x. We have a bound on \a\ of nk since each symbol 
appears at most k times in x. 

We now argue that a (1 + <5)-approximation for d(x,y) can be obtained from a (1 + S/k)- 
approximation for AMDP of P. Let the length of the longest chain in P be i and the min-defect be 
m. Let d be a shorthand for d(x, y). We have i + m = \P\ and I + d = n. The output of AMDP is 
an estimate est such that m < est < (1 + 5/k)m. We estimate d by estj = est + n — \P\. We show 
that est d e [d, (1 + 8)d]. 

We have estd = est + n — \P\ > m + n — \P\ = n — i = d. We can also get an upper bound. 

est d = est + n - \P\ < m + n - \P\ + 6m/k = d + 6(\P\ - £)/k < d + S(n - I) = (1 + 5)d 

Hence, we use the parameters 5/k,j = for the AMDP instance created by our reduction. An 

application of Theorem 12.11 completes the proof. □ 

5 Deterministic streaming algorithm for LCS 

We now discuss a deterministic y^-space algorithm for LCS. This can be used for small sized 
alphabets to beat the bound given in Theorem 14. 1[ For any consistent sequence (CS), the size of 
the complement is called the defect. For indices i,j £ [n], x(i,j) refers to the substring of x from 
the ith character upto the jth. character. The main theorem is: 
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Theorem 5.1 Let 5 > 0. We have strings x and y with full access to y and streaming access to 
x. There is a deterministic one-pass streaming algorithm that computes a (1 + 5) -approximation 
to d(x, y) that uses space. The algorithm performs 0{-\J (<5n)/lnn) updates, each 

taking 0{n 2 Inn/ 5) time. 

The following claim is a direct consequence of the standard dynamic programming algorithm for 
LCS ICLRSOOj . 

Claim 5.2 Suppose we are given two strings x and y, with complete access to y and a one-pass 
stream through x. There is an 0{n)-space algorithm that guarantees the following: when we have 
seen x(l,i), we have the lengths of the LCS between x(l,i) and y(l,j), for all j 6 [n]. 

Our aim is to implement this algorithm in sublinear space. As before, we maintain a carefully 
chosen snapshot of the 0(n)-space used by the algorithm. In some sense, we only maintain a small 
subset of the partial solutions. Although we do not explicitly present it in this fashion, it may be 
useful to think of the reduction of Theorem 14. 1[ We convert an LCS into finding the longest chain 
in a set of points P. We construct a set of anchor points in the plane, which may not be in P. Our 
aim is to just maintain the longest chain between pairs of anchor points. 

Let 8 > be some fixed parameter. We set n = y / '(nlnn) / '5 and u = (In n)/n = \f (5\nn)/n. 
For each i E [n/n], the set Si of indices is defined as follows. 

Si = {[in + 6(1 + fi) r \ \r > 0, b € {-1, +1}} 

For convenience, we treat n, n/n, and (1 + /u) r as integers^. So we can drop the floors used in the 
definition of Si. Note that the \Si\ = 0(fx~ 1 Inn) = 0(n). We refer to the family of sets {Si, S2, ■ ■ ■} 
by S. This is the set of anchor points that we discussed earlier. Note that they are placed according 
to a geometric grid. 

Definition 5.3 A common subsequence of x and y is consistent with S if the following happens. 
There exists a sequence of indices £\ < £2 < • • • £ m such that £j G Si and if character x{k) (k £ 
[in, (i + l)n]/ in the common subsequence is matched to y(k'), then k' £ [li,£i+\]. 

We now have a basic claims about the LCS of two strings (proof deferred to Appendix |B|). This 
gives us some a simple bound on the defect that we shall exploit. Lemma 15.51 makes an important 
argument. It argues that the the anchor points S were chosen such that an 5-consistent sequence 
is "almost" the LCS. 

Claim 5.4 Consider an x(ii),x(i2), ■ ■ ■ ,x(i r ) and y(ji),y(j2), ■ ■ ■ ,y(jr)- Suppose i a is the smallest 
index with value at least i. The defect is at least \ j a — i\. 

Lemma 5.5 There exists an S -consistent common subsequence of x and y whose defect is at most 
(l + 5)d(x,y). 

Proof: We start with an LCS L of x and y and "round" it to be 5-consistent. Let L be 
x{ii), x{i2), ■ ■ ■ ,x{i r ) and y{ji),y{j2), ■ ■ ■ >y(jr)- Consider some p £ [n/n], and let i a be the smallest 
index larger than pn. Set £ p to be the largest index in S p smaller than j a . We construct a new 
common sequence V by removing certain matches from L. Consider a matched pair (x(ib),y(jb)) 
in L. If ib G [pn, (p + l)n] and % < £ p +i, then we add this pair to L' . Otherwise, it is not added. 
Note that jb ^ ^ simply by construction. The new common sequence L' is 5-consistent. 

1 Formally, we need to take floors of these quantities. Our analysis remains identical. 
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It now remains to bound the defect of L'. Consider a matched pair y(jb)) £ L that is 

not present in L' . Let % S [(p — l)n,pn]. This means that % > l p . Let i c be the smallest index 
larger than pn. So £ p is the largest index in S p smaller than j c . Let £ p = pn + (1 + /x) r . We 
have j c — pn = [(1 + /x) r , (1 + u) r+1 ]. Since % £ j c ], the total possible values for is at most 
(1 + fj,) r+1 - (1 + u) r = u(l + nY- By Claim E31 d(x,y) > j c - pn > (1 + /«) r . The number of 
characters of x with indices in [(p — l)n,pn] that are not in L' is at most fj,d(x, y). The total number 
of characters of L' not in L is at most fj,(n/n)d(x,y) < 5d(x,y). □ 

The final claim shows how we to update the set of partial LCS solutions consistent with the 
anchor points. The proof of this claim and the final proof of the main theorem (that puts everything 
together) is given in Appendix iBl 

Claim 5.6 Suppose we are given the lengths of the largest S-consistent common subsequences be- 
tween x(l, in) andy(l,j), for all j £ Si. Also, suppose we have access to x{in,{i + \)n) andy. Then, 
we can compute the lengths of the largest S-consistent common sequences between x(l, (i + l)n) and 
y(l,j) (for all j G Si+\) using n space. The total running time is 0(nn 2 ). 
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A The space bound of Theorem 13.11 

The following claim bounds the probability that \Rt\ exceeds the space bound is at most 7/271. A 
union bound over all t proves the third assertion of Theorem 13.11 

Claim A.l Let M = § ln(4ra 3 /7) ln(2p). Fix t G [n]. Then Prob[|i? t | > e 2 M] < 7/271. 

Proof: For i G [t] let Zi = 1 if i G Rt and otherwise. Then \R t \ = J2 i<t z i- Let A* = E [\ R t\]- 
Below we show that /i < M. We need the following tail bound (which is equivalent to the bound 
of |AS00j . Theorem A. 12): 

Proposition A. 2 Let Z\, . . . , Z m be independent 0/1-valued random variables, let Z = Zi, and 
let n = E[Z]. Then for any C > 0, Prob[Z > C) < {e^/C) c . 

Applying this proposition with C = e 2 M gives Prob[|i?t| > e 2 M] < e~ c which is at most 7/2n 
(with a lot of room to spare). It remains to upper bound fj,. Denoting the upper bound by M, we 
conclude that: 

t t t 

i=l i=l i=l 

We note the following fact (easily proved by induction on r): 

Proposition A.3 For r > 1, Ei=r™W/(^(*) ~ W(i - 1)) < ln( 2(W(t) ~ ( y r ~ 1)) ). 

Thus Yh=i w(i)/(W(t) - W(i - 1) < ln(2W(t)/w(t)) < ln(2p), and so // < M. This completes the 
proof. □ 
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B Proofs from Section [5] 



We first prove another claim from which the proof of Claim 15.41 follows. 

Claim B.l Given a common subsequence x(ii), xfo), ••• ,x(i T ) andy(j\),y{j2), ■ ■ ■ ,y(j r )> the defect 
is at least maxk< r (\ik — jk\)- 

Proof: (of Claim lB~Tj) Assume wlog that ik > jk- Since the i^th character of x is matched to 
jfcth character of y, the length of this common subsequence is at most LCS(x(l, if.), y(l, jk)) + 
LCS(x(ik + 1, n),y(jk + 1, n)). This can be bounded above trivially by jk + (n — ik) = n — (if. — jk)- 
Hence the defect is at least ik — jk- Repeating over all k, we complete the proof. □ 

Proof: (of Claim [5T4"|) The defect is at least \j a — i a \ (by Claim |B~T|) and is also at least \i a — i\ 
(by definition of i a ). If either j a £ [i,i a ] or i S [j a ,i a ], then the defect is certainly at least \j a — i\. 
Suppose neither of these are true. Then j a > i a > i. Let us focus on the characters of x that are not 
matched. No character of x with index in [2, %a) is matched. The characters in n] can only be 
matched to characters of y in (j a ,n] (since (x(i a ),y(J a )) is a match). So the number of characters in 
(i a , n] that are not matched is at least (n — i a ) — (n — j a ) = (j a — i a )- So the number of unmatched 
characters in x is at least j a — i. □ 
Proof: (of Claim [5U]) Consider some j S Si+i, and set x = x(in, (i + l)n). We wish to compute 
the largest 5-consistent CS between in x(l, (i + l)n) and y(l,j). Suppose we look at the portion 
of this CS in x(l,in). This forms a 5-consistent sequence between x and y(l,j'), for some j' £ S{. 
The remaining portion of the CS is just the LCS between x = x(in, (i + l)n) and y(j' ,j). Hence, 
given the LCS length of x and y(j',j), for all j' £ Si, we can compute the length of the largest 
5-consistent CS between x(l,(i + l)n) and y(l,j). This is obtained by just maximizing over all 
possible j' . 

We now apply Claim [5T2l We have x in hand, and stream in reverse order through y(l,j). 
Using 0(n) space, we can compute all the LCS lengths desired. This gives the length of the largest 
5-consistent CS that ends at y(j). This can be done for all y(J), j G S{. The total running time is 



Proof: (of Theorem 15. ip Our streaming algorithm will compute the length of the longest <S- 
consistent CS. Consider the index in. Suppose we have currently stored the lengths of the largest 
5-consistent CS between x(l,in) and y(l,j), for all j G Si. This requires space 0(|5j|) = 0{n)). 
By Claim IBTHl we can compute the corresponding lengths for 5j+i using an additional 0(n) space. 
Hence, at the end of the stream, we will have the length (and defect) of the longest 5-consistent CS. 
Lemma 1531 tells us that this defect is a (1 + 5)- approximation to d(x, y). The space bound is 0(n). 
The number of updates is 0(n/fi) = 0(y/(6n)/lnn), and the time for each update is 0(nn 2 ) = 



0(\S i+ i\nn) = 0{nn 2 ). 



□ 



0((n 



2 Inn)/*). 



□ 
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